Semantic image segmentation is an essential component of modern autonomousdriving systems, as an accurate understanding of the surrounding scene iscrucial to navigation and action planning. Current state-of-the-art approachesin semantic image segmentation rely on pre-trained networks that were initiallydeveloped for classifying images as a whole. While these networks exhibitoutstanding recognition performance (i.e., what is visible?), they lacklocalization accuracy (i.e., where precisely is something located?). Therefore,additional processing steps have to be performed in order to obtainpixel-accurate segmentation masks at the full image resolution. To alleviatethis problem we propose a novel ResNet-like architecture that exhibits stronglocalization and recognition performance. We combine multi-scale context withpixel-level accuracy by using two processing streams within our network: Onestream carries information at the full image resolution, enabling preciseadherence to segment boundaries. The other stream undergoes a sequence ofpooling operations to obtain robust features for recognition. The two streamsare coupled at the full image resolution using residuals. Without additionalprocessing steps and without pre-training, our approach achieves anintersection-over-union score of 71.8% on the Cityscapes dataset.
展开▼